Redwine exploration by Thuy Quach

Abstract

Why some red wines taste better than others? Just because the wine tasters say so or there is another way to tell. Can we tell what make great wine or bad wine from their chemical properties? And if yes, under what conditions the quality of red wines is the best.

This is what we are going to explore: relationship of chemical properties with wine quality.

The analysis included: data structure, statistical summary, distribution plots, boxplots of each variables vs. quality, correlation matrix and scatter plots, final plots and data exploring the strong correlated variables, and reflections.

Dataset

The data set using in this analysis can be found here https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt.

## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpf12B16/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpf12B16/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpf12B16/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpf12B16/downloaded_packages
## [1] "/Users/thuy/Google Drive/Data-analysis-with-R"

Summary of the data

First, let’s see the total of the wine data is:

## [1] 1599

samples.

Then, let’s explore the all variables.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

X is data entry number and quality is the output of the analysis. So, there were 11 total variables. The data is in wide format.

How is about the structure of the data?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Quality was measured as int. All other variables were numerical data.

Statiscal summary of the data was shown below.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Quality was range from 3 to 8. Residual.sugar, chlorides, free.sulfur.dioxide and total.sulfur.dioxide had very large range of data. Do these variables influence wine quality?

Univariate Analysis

Distribution of individual variables by histogram and density:

First, let us explore the distributions of each variables using ggplot.

The data is in the format of wide data which make difficult for R to draw multiple variable plots. Therefore, I reshaped the data into long format.

# reshape data into long format
long_data <- melt(redwine, id.vars=c("X", "quality")) 

All variables

# plot the distribution and density
dist_plot <- ggplot(long_data, aes(x=value)) +  
  geom_histogram(aes(y= ..density..), 
                 binwidth=0.05, colour="green", fill="white") + 
  geom_density(color = "red", alpha = 0.2)

# iterate plot 
dist_plot + facet_wrap(~ variable, scales = "free") 

Some of the variables seem to follow normal distribution such as density, pH and fix.acidity while few others are right skewed distribution such as residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide, sulphate, alcohol.

Quality

Most of the wine samples had wine quality of 5 and 6. Let’s get the real number.

# calculate the % of wine with quality 5 and 6
100*count(subset(redwine, quality == 5 | quality == 6))/length(redwine$quality)
##          n
## 1 82.48906

There was 82.49 % of wines had quality of 5 or 6.

Data correlations

Let us run the correlation matrix to see what chemcial properties have strong relationships with wine quality and also with each others using ggpairs. It was difficult to plot ggpairs on all variables because the space allotted to the plot couldn’t hold 12^2 variables, so I created three groups and made sure that the variable “quality” (col 13) was presented in all.

We learned that any correlation above 0.3 is meaningful and 0.7 is pretty strong. Let us see if we could find any in the below results.

group1 <- ggpairs(redwine[c(13, 2:5)])
group1

Correlation efficients between quality with volatile.acidity was -0.391, citric.acid with fixed.acidity was 0.672, citric.acid with volatile.acidity was -0.552.

group2 <- ggpairs(redwine[c(13, 6:8)])
group2

Correlation efficient between total.sulfur.dioxide and free.sulfur.dioxide was 0.668.

group3 <- ggpairs(redwine[c(13, 9:12)])
group3

Correlation efficient between quality and alcohol was 0.476, pH and density was -0.342.

Bivariate analysis

What chemical properties correlated with each others?

Citric.acid and fixed.acidity

Total.sulfur.dioxide and free.sulfur.dioxide

pH and density:

What chemical influence wine quality?

From the above correlation analysis, I found only alcohol and volatile.acidity had correlation coeffiencients bigger than 0.3 with quality. Since we are interested in what make best wine, it is important to consider some other chemical properties which may have some impacts.

Let’s see the below results.

##                             [,1]
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632

We could see that there were 6 chemical properties (volatile.acidity, total.sulfur.dioxide, pH, free.sulfur.dioxide, density, chlorides) have negative correlation with quality. It suggested that those chemical properties make wine taste worse. Among those properties, volatile.acidity had the most impact with correlation of -0.391. While sulphates, residual.sugar, fixed.aciditym citric.acid, alcohol make wine tast better. Among those properties, sulphates, citric.acid, alcohol had the strongest impact with correlations of 0.251, 0.226 and 0.476 respectively.

Correlation of chemical properties vs. wine quality by boxplots:

From the boxplots, it looked like alcohol, sulphates, volatile.acidity and citric.acid might have impacts on the quality of wines. The results were consistent with previous correlation analysis.

Let’s zoom the plots of these chemical properties up.

Alcohol

As the wine quality increase from 3 to 8, there was an increase in average of alcohol, except for quality of 5. We also could see that wine with quality of 5 has many outliers.

Let’s compare the distributions of alcohol for different wine qualities

The distribution of alcohol were simlar and almost normal for all wine qualities except 5 where the distribution was much narrower.

Let’s see the summary of its alcohol.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9

The mean of alcohol for quality of 5 was 9.89.

Let’s compare with other qualities

quality_vs_alcohol <- redwine %>%
  group_by(quality) %>%
  summarize(avg_alcohol = mean(alcohol)) %>%
  arrange(avg_alcohol)

quality_vs_alcohol
## Source: local data frame [6 x 2]
## 
##   quality avg_alcohol
##     (int)       (dbl)
## 1       5    9.899706
## 2       3    9.955000
## 3       4   10.265094
## 4       6   10.629519
## 5       7   11.465913
## 6       8   12.094444

The average alcohol was increased from 9.955 to 11.094 (1.2 times) when wine quality increased from 3 to 8, except for quality of 5 where the average alcohol was 9.899.

Citric.acid

As the wine quality increase from 3 to 8, there was an increase in average of citric.acid

Let’s compare the distributions of citric.acid for different wine qualities

We could see the mean of citric.acid shifted to the right with wine quality increased.

Let’s summary and arrange the mean of citric.acid

quality_vs_citric.acid <- redwine %>%
  group_by(quality) %>%
  summarize(avg_citric.acid = mean(citric.acid)) %>%
  arrange(avg_citric.acid)

quality_vs_citric.acid
## Source: local data frame [6 x 2]
## 
##   quality avg_citric.acid
##     (int)           (dbl)
## 1       3       0.1710000
## 2       4       0.1741509
## 3       5       0.2436858
## 4       6       0.2738245
## 5       7       0.3751759
## 6       8       0.3911111

It was clearly to see the average value of citric.acid increased from 0.171 to 0.391 (2.3 times) when quality increased from 3 to 8.

Sulphates

As the wine quality increase from 3 to 8, there was an increase in average of sulphates.

Let’s compare the distributions of citric.acid for different wine qualities

We could see the distributions of sulphates were similar and the mean of sulphates shifted to the right with wine quality increased.

Let’s summary and arrange the mean of sulphates

quality_vs_sulphates <- redwine %>%
  group_by(quality) %>%
  summarize(avg_sulphates = mean(sulphates)) %>%
  arrange(avg_sulphates)

quality_vs_sulphates
## Source: local data frame [6 x 2]
## 
##   quality avg_sulphates
##     (int)         (dbl)
## 1       3     0.5700000
## 2       4     0.5964151
## 3       5     0.6209692
## 4       6     0.6753292
## 5       7     0.7412563
## 6       8     0.7677778

It was clearly to see the average value of sulphates increased from 0.570 to 0.768 (1.3 times) when quality increased from 3 to 8.

Volatile.acidity

As the wine quality increase from 3 to 8, there was an decrease in volatile.acidity.

Let’s compare the distributions of volatile.acidity for different wine qualities

We could see the distributions of volatile.acidity were similar and the mean of volatile.acidity shifted to the right with wine quality increased.

Let’s summary and arrange the mean of volatile.acidity

quality_vs_volatile.acidity <- redwine %>%
  group_by(quality) %>%
  summarize(avg_volatile.acidity = mean(volatile.acidity)) %>%
  arrange(avg_volatile.acidity)

quality_vs_volatile.acidity
## Source: local data frame [6 x 2]
## 
##   quality avg_volatile.acidity
##     (int)                (dbl)
## 1       7            0.4039196
## 2       8            0.4233333
## 3       6            0.4974843
## 4       5            0.5770411
## 5       4            0.6939623
## 6       3            0.8845000

It was clearly to see the average value of volatile.acidity decreased from 0.884 to 0.404 (2.2 times) when quality increased from 3 to 8.

Summary of unvariate analysis:

There was 1599 samples of wine with quality in range from 3 to 8. There was 82.49 % of wines had quality of 5 or 6.

There were strong correlations among the chemical properties such as citric.acid with fixed.acidity (0.672), citric.acid with volatile.acidity (-0.552), total.sulfur.dioxide and free.sulfur.dioxide (0.668), and pH and density (-0.342).

There were aslo strong correlations of some chemicals with quality such as quality with volatile.acidity (-0.391), quality and alcohol (0.476), quality and sulphates (0.251), quality and citric.acid (0.226).

Group the quality in three group using new variable

# turn data in to data.table
wine_table <- data.table(redwine)

# add new rating variable
wine_table[, rating := ifelse(quality <=4, "bad",
                       ifelse(quality >=5 & quality <=6, "good",
                       ifelse(quality >=7, "very good", NA)))]

Let’s summarize the wine by rating.

wine_table %>%
  group_by(rating) %>%
  summarize(n_obs = n())
## Source: local data table [3 x 2]
## 
##      rating n_obs
##       (chr) (int)
## 1      good  1319
## 2 very good   217
## 3       bad    63

So, there was 217 very good wines, 1319 good wines and 63 bad wines.

Multivariate Plots Section

Citric.acid and fixed.acidity correlation code by quality

Citric.acid and fixed.acidity correlation code by new rating variable

Average of all variables code by quality

## Source: local data frame [6 x 12]
## 
##   quality avg_alcohol avg_citric.acid avg_sulphates avg_volatile.acidity
##     (int)       (dbl)           (dbl)         (dbl)                (dbl)
## 1       5    9.899706       0.2436858     0.6209692            0.5770411
## 2       3    9.955000       0.1710000     0.5700000            0.8845000
## 3       4   10.265094       0.1741509     0.5964151            0.6939623
## 4       6   10.629519       0.2738245     0.6753292            0.4974843
## 5       7   11.465913       0.3751759     0.7412563            0.4039196
## 6       8   12.094444       0.3911111     0.7677778            0.4233333
## Variables not shown: avg_fixed.acidity (dbl), avg_pH (dbl),
##   avg_residual.sugar (dbl), avg_density (dbl), avg_total.sulfur.dioxide
##   (dbl), avg_free.sulfur.dioxide (dbl), avg_chlorides (dbl)

Average of all variables code by rating

## Source: local data table [3 x 12]
## 
##      rating avg_alcohol avg_citric.acid avg_sulphates avg_volatile.acidity
##       (chr)       (dbl)           (dbl)         (dbl)                (dbl)
## 1       bad    10.21587       0.1736508     0.5922222            0.7242063
## 2      good    10.25272       0.2582638     0.6472631            0.5385595
## 3 very good    11.51805       0.3764977     0.7434562            0.4055300
## Variables not shown: avg_fixed.acidity (dbl), avg_pH (dbl),
##   avg_residual.sugar (dbl), avg_density (dbl), avg_total.sulfur.dioxide
##   (dbl), avg_free.sulfur.dioxide (dbl), avg_chlorides (dbl)

We could clearly see the trend that the higher the wine rating the higher of both avg_fixed.acidity and avg_citric.acid were. It is supported that with both fix.acidity and citric.acid were strongly correlated with correlation coefficient of 0.672, and both chemicals were also correlated with quality with correlation of 0.124 and 0.226 respectively.

Total.sulfur.dioxide and free.sulfur.dioxide code by quality

ggplot(quality_vs_total_variables, aes(x = avg_total.sulfur.dioxide, y = avg_free.sulfur.dioxide, 
       color = as.factor(quality))) +
  geom_point()

ggplot(rating_vs_total_variables, aes(x = avg_total.sulfur.dioxide, y = avg_free.sulfur.dioxide, 
       color = as.factor(rating))) +
  geom_point()

We could see the correlation of free.sulfur.dioxide and total.sulfur.dioxide. It was interesting to note that the wine quality was best with the middle range of both chemical properties (14 and 35 respectively). It was suggested that low concentration of the chemicals make wine taste bad, however too much of them also reduce wine quality. It also supported that the two chemicals were not well correlated with quality.

pH and density code by quality:

Correlated but not very well.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection